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Abstract. In this work we consider series estimators for the conditional mean in light 
of three new ingredients: (i) sharp LLNs for matrices derived from the non-commutative 
Khinchin inequalities, (ii) bounds on the Lebesgue constant that controls the ratio between 
the L°° and L 2 -norms, and (iii) maximal inequalities with data-dependent bounds for 
processes whose entropy integrals diverge at some rate. 

These technical tools allow us to contribute to the series literature, specifically the 
seminal work of Newey (1995), as follows. First, we weaken considerably the condition 
on the number k of approximating functions used in series estimation from the typical 
k 2 /n — > to k/n — > 0, up to log factors, which was available only for splines before. 
Second, under the same weak conditions we derive L 2 rates and pointwise central limit 
theorems results when the approximation error vanishes. Under a incorrectly specified 
model, i.e. when the approximation error does not vanish, analogous results are also 
shown. Third, under stronger conditions we derive uniform rates and functional central 
limit theorems that holds if the approximation error vanishes or not. That is, we derive the 
strong approximation for the entire estimate of the non-parametric function. Finally, we 
derive uniform rates and inference results for linear functionals of interest of the conditional 
expectation function such as its partial derivative or conditional average partial derivative. 



1. Introduction 

Series estimators have been playing a central role on various fields. In econometric 
applications it is common that the exact form of a conditional expectation is unknown and 
having a flexible functional form can lead to improvements over a pre-specified functional 
form. Series estimation offers exactly that by approximating the unknown function based 
on k basic functions, where k is allowed to grow with the sample size n to balance the trade 
off between variance and bias. 

Several asymptotic properties of series estimators have been investigated in the liter- 
ature. The focus has been on convergence rates and asymptotic normality results (see 
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Andrewsl . Il99ll : Eastwood and Gallantl . Il99ll : iGallant and Souzal . Il99ll : iNeweyl . Il997l . and 
the references therein). 

This work revisits the topic by making use of three critical ingredients: 

i. The sharp LLNs for matrices derived from the non-commutative Khinchin inequal- 
ities. 

ii. The sharp bounds on the Lebesgue constant that controls the ratio between the L°° 
and L 2 -norms of the least squares approximation of functions (which is bounded or 
grows like a log k in many cases) . 

iii. Maximal inequalities with data-dependent bounds for processes whose entropy in- 
tegrals diverge at some rate. 

To the best of our knowledge, these results are the first applications of the first ingredient 
to stati stical es t imatio n problems. Regarding the second ingredient, it has already been 
used by iHuana (|2003al ) but for splines only. The third ingredient was derived to allow for 
weak moment conditions. All of these ingredients are critical for generating sharp results. 

This approach allows to contribute to the series literature in several directions. First, 
we weaken considerably the condition on the nu mber k of app roximating functions used in 



series estimation from the typical k 2 /n — > (see iNewevl . 1 1 99 71 ) to 



k/n — > (up to logs) 



for bounded basis which was available only for splines before (jHuangj . I2003al ; iStond . 11994 ). 
Second, under the same weak conditions we derive L 2 rates and pointwise central limit 
theorems results when the approximation error vanishes. Under a misspecified model, i.e. 
when the approximation error does not vanish, analogous results are also shown. Third, 
under stronger conditions we derive uniform rates and functional central limit theorems that 
hold if the approximation error vanishes or not. By the functional central limit theorem we 
mean here that the entire estimate of the non-parametric function is uniformly close to a 
Gaussian process that can change with n. That is, we derive the strong approximation for 
the entire estimate of the non-parametric function. 

Another set of results established here pertains to the estimation and inference methods 
for linear functionals of the conditional mean function g : X — > R. Examples of linear 
functionals 6 of interest include, for x = (w, v) & X and Xj denoting the j-th component of 
x, 



1. the partial derivative: 8(x) = d x .g{x); 



LEAST SQUARES SERIES: POINTWISE AND UNIFORM RESULTS 3 

2. the average partial derivative: 9 = J d Xj g(x)dfj,(x); 

3. the conditional average partial derivative: 9{w) = f d Xj g(w,v)d[i(v\w). 

where the measure [i entering the definitions above are taken as known; the result can be 
extended to include estimated measures. Under weak conditions we derive pointwise results 
for rates of convergence, large sample distributions and inference methods based on the 
Gaussian approximation. Under stronger conditions we derive new strong approximation 
for the entire estimate of the non-parametric function. Specifically, we derive uniform 
results for rates of convergence, large sample distributions and inference methods based on 
the Gaussian approximation. 

Notation. In what follows, all parameter values are indexed by the sample size n, but 
we omit the index whenever this does not cause confusion. We use the notation (a) + = 
max{a, 0}, a V b = max{a, 6} and a A b = min{a, b}. The ^2-norm of a vector v is denoted 
by ||w||, while for a matrix Q the maximum eigenvalue is denoted by ||Q||. We also use 
standard notation in the empirical process literature, 

n 

E n [/]=E n [/M]=^/K)/n, 
i=i 

and we use the notation a < b to denote a ^ cb for some constant c > that does not 
depend on n; and a <p b to denote a = Op(b). Moreover, for two random variables X, Y 
we say that X =d Y if they have the same probability distribution. 

2. Set-Up 

Consider a sequence of models, indexed by the sample size n, 

Vi = g(xi) + ei, E[€i\xi]=0, i» G ^ C l d , i = l,...,n, (2.1) 

where is the response variable and x i— > g(x) = E[yi\xi = x] G Q n a class of functions. We 
assume that X is a compact set in M. d and that all conditions stated below hold uniformly 
in n. For notational convenience we omit indexing by n in what follows. 

Assumption A.l Random vectors (ej,a^)',? = l,...,n, are i.i.d. and sup i<n af = 
E[e^\xi\ is bounded. 

We approximate the function x >->■ g{x) by linear forms x >->■ p(x)'b, where 



x^p(x) = {pj{x),j = l,...,k) 
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is a vector of approximating functions that can change with n. We denote the regressors as 

Pi =p(xi) = (Pj(Xi),j = 1, ...,&) 

where we use i to index observations p(xi), and j to index components of p(xi). The next 
assumption impose regularity conditions on the regressors. 

Assumption A. 2 Eigenvalues of Q := E\pip'j] are bounded above and away from zero. 
Also we let 

£ k := supHQ-^pCs)!!, 

and that k is chosen so that 

$logn/n->0. (2.2) 

Normalization. To simplify notation, we normalize Q = I, but we shall treat Q as 
unknown, that is we deal with random design. 

The relation (|2.2|) restricts how fast k can grow with n but it is a mild condition for many 
interesting basis of functions as discussed below. Condition A. 2 also imposes that p^s are 
not too co-linear. The following proposition establishes a simple sufficient condition for A. 2 
based on orthonormal bases with respect to a different measure. 

Proposition 1 (Stability of Bounds on Singular Values). Let x ~ F and the regressors 
Pi = p{xi), with x i — y p{x) orthonormal on {X, fj,) for some measure \i. Then A. 2 is satisfied 
if dF/dfj, is bounded above and away from zero. 

It is well known that the least squares parameter solves 

P = arg min E[(yi - p(xi)'b) 2 ], 

which by (|2.ip implies that f3 also solves 

/3 = arg min E[(g(xi) - p(xi)'b) 2 ]. 

b£R k 

We call x i — y g(x) the target function and x \— > gt{x) = p{x)' (3 the surrogate function. In 
this setting, the surrogate function provides the best linear approximation to the target 
function. 

Accordingly we have a many regressors model 

Ui=PiP + Ui, E[uiXi] = 0, Ui-.= ri + ei 
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where 

Ti := r(xi) = g(xi) - p(xi)'P, 
is the approximation error. The least squares estimator of (3 is 

ft = arg minE n [(yj - p-6) 2 ], 

6eR fe 

which induces the estimator g{x) := p{x)' '{3 for the target function g(x). Thus, it follows 
that we can decompose the error in estimating the target function as 

g(x) - g(x) = p(x)'0 - f3) + r(x), 

where the first term in the right hand side is the estimation error and the second term is 
the approximation error. 

We are also interested in various quantities 9 created as linear functionals of the con- 
ditional mean function. As discussed in the introduction, examples include the partial 
derivative function, the average partial derivative function, and the conditional average 
partial derivative. By the linearity of the series approximations, the above parameters can 
be seen as linear functions of the least squares coefficients f3 up to an approximation error. 
Importantly, in each example above we could be interested in estimating 9{w) simultane- 
ously for many values of w £ W. We let X C W denote the set of indices of interest. By the 
linearity of the series approximations, the above parameters can be seen as linear functions 
of the least squares coefficients (3 up to an approximation error, that is 

6(w) = £(w)'f3 + r n (w), wel, (2.3) 

where £(w)' {3 is the series approximation, with £(w) denoting the /c-vector of loadings on the 
coefficients, and r n (w) is the remainder term, which corresponds to the approximation error. 
Indeed, the decomposition (j2.3|) arises from the application of different linear operators A 
to the decomposition g(-) = p{-)' (3 + r(-) and evaluating the resulting functions at w: 

(Ag(-)) [w] = (A P (-)) [w]'(3 + (Ar(.)) [«,]. (2.4) 

Examples of the operator A corresponding to the cases enumerated in the introduction are 
given by, respectively, 

1. a differential operator: (Af)[x] = (d Xj f)[x], so that 

i(x) = d Xj p(x), r n (x) = d Xj r(x) ; 
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2. an integro-differential operator: Af = J d Xj f(x)d[i(x), so that 

d X] p(x)dfi(x), r n = J d Xj r{x)dn(x) ; 

3. a partial integro-differential operator: (Af)[w\ = j d Xj f(x)d/j,(v\w), so that 
l(w) = / d Xj Z(x)dfj,(v\w), r n {w) = \ d x r(x)d/j,(v\w). 



For notational convenience, we use the formulation (|2.3[) in the analysis, instead of the 
motivational formulation dE2 



We shall provide the inference tools that will be valid for inference on the series approx- 
imation 

and, provided that the approximation error r n (w), w E X, is small enough as compared to 
the estimation noise, these tools will also be valid for inference on the functional of interest: 

9(w), wex. 

In this case, the series approximation is an important intermediary target, whereas the 
functional 9 is the ultimate target. The inference will be based on the plug-in estimator 
9(w) := £(w)' ft of the the series approximation £(w)'/3 and hence of the final target 9(w). 



3. Approximation Properties of Least Squares 

Next we consider approximation properties of the least squares estimator. Not sur- 
prisingly, approximation properties must rely on the particular choice of approximating 
functions. At this point it is instructive to consider particular examples of relevant basis 
used in the literature. 

Example 1 (Polynomial series). Consider X = [0, 1] and polynomial series given by 

p(x) = (l,x,x 2 ,...,x fc_1 ). 

In order to reduce collinearity problems, orthonormolize with respect to the uniform measure 
to get the Legendre polynomials 

p(x) = (l,x,2- 1 (3x 2 -l),...) 

with 
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Example 2 (Fourier series). Consider the domain X = [0, 1] and a fourier series given by 

p(x) = (l,cos(27ri),sin(27r/),j = l,2,...,fc/2 - 1), 
for k even, are orthonormal with respect to the Lebesgue measure, with 

C* < >/*. 

Example 3 (Splines). Let X = [—1, 1] and consider the linear spline series, or spline series 
of order 1, with a finite number of equally spaced knots k%, &2, k r in X: 

p{x) = (1, X, (z - fei) + , (z - k r ) + y. 

The cubic spline series takes the form: 

p(x) = (1, (x, x 2 , x 3 ), (x - hf + , (x - Av)^)'. 

The function x i— > p{x)'b constructed using cubic splines is twice differentiable in x for any 
6. Instead of pure splines, we often use B-splines, which are linear transformations of the 
above functions with lower multicellularity; moreover, 

Example 4 (Cohen-Deubechies-Vial wavelet bases). Le t X = [0, 1] a nd co nsider Cohen- 



Deub echies-V ial (CDV) wavelet bases. See Section 4 in I Cohen et al.l (|1993l ) and Chapter 



7.5 in iMallatl (|2009l ) for details on CDV wavelet bases. CDV wavelet bases are a class of 
orthonormal bases of L 2 [0, 1], which is the standard L 2 space for functions defined on [0, 1]. 
Each such basis is built from a Deubechies scaling function (j) (defined on R) and the wavelet 
ip of order N starting from a fixed resolution level Jo such that 2 Jo > 2N. The <p an d ip are 
supported on [0, 2N — 1} and [— N + 1, N], respectively. Translate <f> so that 4> has support 
[-N + 1,N]. Let 

4>i m = 4>(2 l ■ -m), ipim = ip(2 l ■ -m), l,m>0. 

The (frjomiipim that are supported in the interior of [0,1] are all kept (^j^ = 4>,j Q m for 
m = N, . . . , 2 Jo - N - 1; = *l>im for ™ = N, . . . , 2 l - N - 1, 1 > J ), and suitable 
boundary corrected functions are added, so that {^j^j^Q 1 U {V ; /m}o<m<2 ; ,/> j fo rm s an 
orthonormal basis of L 2 [0, 1]. Suppose that k = 2 J for some J > Jq. Let 

p{x) = (0j OiO (x), . . . ,(^ Joi2 j _ 1 (x),^j 0i o(x), . . . ,^-1,2^-1-1 0))'- 

Then 

< Vk. 

CDV wavelet bases are useful for approximating not necessarily periodic functions. 
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Example 5 (Tensor Products). Generalizations to multiple regressors are straightforward 
using tensor products of unidimensional series. Suppose that the basic regressors are 

then we can create d series for each basic regressor, then create all interactions of the d 
series, called tensor products, and collect them into regressor vector pj. If each series for a 
basic regressor has J terms, then the final regressor has dimension 

k<J d , 

which explodes exponentially in the dimension d. The bounds on ^ in terms of k remain 
the same as in one-dimensional case. 

Each base described in Examples [TJ{3] has different approximation properties which also 
depend on the particular class of functions Q n . The following captures the essence of this 
dependence into two quantities. 

Assumption A. 3 For each g £ Q n and integer k > 1, there are finite constants c k and 
£ k such that 

W\\f,2 ■= J J r 2 (x)dF(x) < c k and \\r\\ F ,oo ■= sup \r(x)\ < i k c k , 

where we call t k the generalized Lebesgue constant. 

These quantities characterize the approximation properties of the underlying class of 
functions under L2 and uniform distances. Next we discuss primitive bounds on them. 

3.1. Bounds on c k . In what follows we call the case where c k — > the correctly specified 
case. In particular, if for every n large enough the series are formed from bases that span 
Q n , then c k — > as k — > 00. However, if series are formed from bases that do not span Q n , 
then Cfc — > Coo as k — > 00 where potentially Coo > 0. We call any case where c k -ft the 
incorrectly specified case. 

Moreover, since 

inf \\g - p'b\\ F ,2 < ^ < inf \\g - p'b\\ F:OQ , 

b b 

the approxi mation rates c k are readily available from rates c k computed in Approximation 



Theory (see lDeVore and Lorentj (|l993l )). For example, if Q n is s-smooth, namely a Holder 



class of smoothness order s, then 

c k < k-*/ d 
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for the examples of series given above, and 

for splines of order so- However, we do not have to specify Q n in terms of smoothness. 

3.2. Bounds on 4- A least squares approximation by a particular series for the function 
class Q n is called co-minimal if the generalized Lebesgue constant 4 is small in the sense of 
being a slowly varying function in k. 

A valid (arguably crude) bound on 4 5 which is independent of Q n , is 

4 < & + l, 

which is not small since > y/k for many interesting basis. Much sharper bounds follow 
from Approximation Theory for some important cases. We list a few examples next. 

Example 6 (Fourier series, continued). For Fourier series on X = [0, 1], F = U(0, 1), and 
Qn C C(X) 

4 < Cblogfc + Ci, 
where here and below Co and C\ are some universal constants. 

Example 7 (B-splines, continued). For B-splines of order s on X = [0,1], F = U(0, 1), 
and Q n C C(X) 

4 < Co, 

under approximately uniform placement of knots. 

Example 8 (Chebyshev polynomials). For Chebyshev polynomials on X = [—1,1], dF(x)/dx = 
and Q n C C{X) 

4 < Cblogfc + Ci. 

Example 9 (Local polynomials). For local polynomials of order s on X = [0, 1], F = 

f/(0, 1), and Q n C G, a Holder class, 

4 < Co. 

Example 10 (Tailored Function Classes). For each type of series approximations, it is 
possible to specify function classes for which the generalized Lebesgue constants are small. 
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Since the Lebesgue constant depends on the particular basis and on the underlying prob- 
ability measure, it is important to have a stability result for the Lebesgue constant. The 
next proposition provides a bound on i^Ck for most functions in the a-ellipsoid class 



J>jO)r%- :^€M,j > 1 



according to a Gaussian measure on the coefficients £j, j > 1, provided the basis functions 
are bounded and Lipschitz. 

Proposition 2 (Generic Stability of Approximation Error for a-Ellipsoid) . Consider the 
standard Gaussian measure on the coefficients £j, j > 1, let f = Ylj>iPj{ x )j~ a ^j an d 
let £k(f) an d c k(f) denote respective the generalized Lebesgue constant and the L>2 approx- 
imation rate associated with f. If the basis {pj(x)}j>i obey sup a , e _^j> 1 |pj(a;)| < 1 and 
su Pxex \\^Pj( x )\\ — Mj, with j~ a Mj log 1 / 2 j = o(l) as j — > 00, then 

P (UfH(f) < dV 2 y/(a-l/2)\ogkk- a + 1 / 2 ') = 1 - o(l). 

In the case of orthogonal basis, most function will have in this class have c& = k~ a+l / 2 . 
Thus, Proposition [2] establishes that is slow varying for those functions. 

The following example illustrate the performance of the series estimator using different 
basis for a real data set. 

Example 11. (Real Data) Here g(x) is the mean of log wage (y) conditional on education 

x e {8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20}. 
The function g(x) is computed using population data - the 1990 Census data for the U.S. 



men of prime age (see Angrist, Chernozhukov and Fenrandez-Val lAngrist et al.l (|2006l ) for 
more details). We would like to know how well this function is approximated when common 
approximation methods are used to form the regressors. For simplicity we assume that Wi 
is uniformly distributed (otherwise we can weigh by the frequency). In population, least 
squares estimator solves the approximation problem: mim, E[{g(xi) — Pib} 2 ] for pi = p(xi), 
where we form p{x) as (a) linear spline (Figure 2, left) and (b) Polynomial series (Figure 2, 
right), such that dimension of p(x) is either K = 3 or K = 8. 

Then we compare the function g{x) to the linear approximation g(x)' ' {3 graphically. We 
also record RMSAE as well as the maximum error MAE. The approximation errors are 
given in the following table: 
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Approximation by Linear Splines, K=3 and 8 Approximation by Polynomial Series, K=3 and 8 




8 10 12 14 16 18 20 8 10 12 14 16 18 20 

ed ed 



Figure 1. 





spline K = 3 


spline K = 8 


Poly K = 3 


Poly K = 8 


L2 Error 


0.12 


0.08 


0.12 


0.05 


Lqo Error 


0.29 


0.17 


0.30 


0.12 



In this example, the Lebesgue constant of the polynomial approximations is comparable 
to the Lebesgue constant of the spline approximations. 

4. Limit Theory 

4.1. L 2 rate of convergence. After we have established the set-up, we proceed to derive 
our results. We start with L 2 rate of convergence result. 

Theorem 1 (L 2 rate of convergence). Assume that conditions A.1-A.3 hold. Then, under 
Ck\0 we have 

||0-?||f,2 <p y^k/n + Ck 
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and under c k \ Coo > we have 



\g - 9\\f,2 - Coo <P y/k/n + c k - Coo + 



k log n 



n 



For most series with < y/k the condition £|logn = o(n) amounts to fc l ogn = o(n). 



This result weakens the rate requirements obtain ed in (Newev . 119971 ; iHuangj . l2003al ) with 
unknown design and is as sharp as the result of IStond (|1994l ) obtained for splines only. 
Under correct specification, the fastest rate is achieved by setting the approximation error 
and the sampling error to be of the same order, 

\J~kJn x Cfc. 

One consequence of this results is for the common a-smooth classes the series estimators 
achieve the optimal rate of convergence in the L 2 metric under very weak assumptions. 

4.2. Pointwise Limit Theory. Next we focus on pointwise limit theorems. That is, for 
a fix sequence Y^i a i = 1- we will show, pointwise results can be achieved under 

weak conditions similarly to the ones we required to achieve the rates of convergence in 
Theorem [TJ 

Lemma 1 (Pointwise Linearization). Suppose A.1-A.3 hold. We have that for any a G S k ~ 1 

vW(£- /3) = a'Gnipiiei + r»)] + Rm, (4.5) 
where the term R\ n , summarizing the impact of unknown design, obeys 



Rln < P ,/!£^ (1 + vie 

V n 



fcCfc). 



(4.6) 



Moreover, 

yW(/3 -P) = a'G n [ Pl ei} + R ln + R 2n , (4.7) 

where the term Rin, summarizing the impact of approximation error on the sampling error 
of the estimator, obeys 

R2n <P tkCf (4.8) 

We obtain this linearization and subsequent pointwise normality results under consid- 
erably weaker conditions on the growth of k than those published in the literature, which 
typically impose that 

kil/n -> 0, 



LEAST SQUARES SERIES: POINTWISE AND UNIFORM RESULTS 



13 



whereas here 

log n/n — > 

is made possible in many cases. Also, as a special case, we recover the extremely sharp 
results of Stone and Huang for splines, who do not impose k^/n — > under the condition 
that maximal approximation error c k £k vanishes at \J k log n rate, albeit here we generally 
do not require \Jk log nlkCk — > 0, so our results for this special case are slightly more general. 
However, as in Stone and Huang, our conditions on the growth of k are the weakest when 
that maximal approximation error Ckik vanishes at \Jk log n rate. In summary, the only 
condition that generally matters for linearization (14, 5ft is that R\ n — > 0. In particular, our 
results in (|4.5p - (l4,6p allow for misspecification, albeit in this case, the requirement R\ n — > 
limits the growth of k. Moreover, we conjecture that the bound on R\ n can be improved 
for splines to 

Rm <p \ + v/loi^ • l k Ck). 

y n 

since it is attained by local polynomials and splines are also similarly localized. 

In order to establish normality we require the following technical conditions. 

Assumption A. 4. The disturbance e« is conditionally uniformly integrable, namely for 
each M — > oo, 

sup£ k 2 l{|ei| > M}\xi = x] 
and the maximal approximation error is not too large 

sup \r(x)\ < £ k c k = o(\/n/£ k ). 



Theorem 2 (Pointwise Normality). Consider our least squares problem or, more generally, 
any problem where the estimator of g{x) = p{x)' (3 + r(x) takes the form p{x)' j3, where j3 
admits linearization of the form |^.5p - |775| j- Suppose A.1-A.4 hold. 

We have that if R\ n — >p 0, for any deterministic sequence {a} with \\a\\ = 1 

where under i?2 n -f^p 0, we set U = Q, := Q~ 1 E[{e.i + ri) 2 pip' i }Q~ 1 , and under i?2 n — >p we 
can set ft = £Iq := Q^ 1 Elefpip'^Q' 1 . Moreover, for any deterministic sequence x E X and 
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s(x) := fl^pix) 

and if the approximation error is negligible relative to the standard error, namely T/nr(x) = 

o(lkWII), 

The result delivers pointwise convergence in distribution uniformly in x £ X since X is 
compact and we allowed for any deterministic sequence within X. The comments given 
after the linearization result in Lemma Q] apply here as well. Note that the normalization 
factor ||s(x)|| is the pointwise standard error, and it is of a typical order ||s(x)|| oc \fk at 
most points. (For splines and trigonometric series this holds uniformly across all points.) 
In this case the condition for negligibility of approximation error -y/nr(x)/||s(a;)|| — > can 
be replaced by 

\fnjk ■ c k £ k -)■ 0. 

4.3. Uniform Limit Theory. Finally we turn to a uniform limit theory. Not surprising, 
stronger conditions are required for our results to hold when compared to the pointwise 
case. Here we need the following assumption on the tails of the regressor errors and on the 
basis. 

Assumption A. 5 For some m > 2, 

JZm/(m-2) 

sup E[\ei\ m \xi = x] < oo and — < 1. 

Letting a{x) := p(^)/||p(x)||, there is a constant a < oo such that for all x,x' 6 X 

\\a(x) - a(x')\\ < L lk \\x - x'\\, L lk < k a . 



Lemma 2 (Uniform Linearization). Suppose that Assumptions A. 1-A. 5 are satisfied. Then, 
uniformly in x G X 

y/aa(x)'@-P) = a(x)'G n [p i (e i + n)] + R ln , (4.9) 
where R\ n , summarizing the impact of unknown design, obeys 

Si^PtP^^V^ + ^-te). (4.10) 
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Moreover, 

— a(x)'G n [piEi] + R ln + R 2n , (4.11) 

where R 2n , summarizing the impact of approximation error on the sampling error of the 
estimator, obeys 



R2n <p yfiogn ■ t k Ck. (4.12) 

We obtain this linearization under weak conditions (in fact, it is not clear if anyone has 
gotten any analogous results before) also allowing for non-vanishing approximation error. 

Theorem 3 (Uniform Rate). Consider our least squares problem or, more generally, any 
problem where the estimator of g(x) = p(x)'(3 + r(x) takes the form p(x)' /3 , where (3 admits 
uniform linearization of the form |^.9| )- f7T71?| ). 

Under Assumptions A.1-A.5, we have that 



sup \ a(x)'G n \pi€i]\ < P y4ogn. 



Moreover, for R\ n and R 2n given above we have 



sup \p(x)'0-j3)\ < P ^(y/fo^n~ + R ln + R 2n ) 



sup \g(x) - g(x)\ < P —=( y/log n + Ri n + R 2n ) +ikCk- 



and 



The resulting rates are close to the optimal rate within logs if the Lebesgue constant i k 
behaves like log n, which is reasonable in a number of examples, and if R\ n + i?2n log c n, 
which is possible in many though not all cases. Again, conditions here improve the rates 



obtained in the previous work of (jNeweyl . Il997l ). Relative to pointwise or L 2 results, we 



get only an extra logn factor in the rate. Note, however, that the assumptions on the 
error term are much stronger than in the pointwise case. If the errors have heavier tails, 
then the uniform rates can be much slower. In such cases, if one is simply interested in 
estimates of some location function, then one could use median regression estimator that 
will achieve faster uniform convergence rates, since the "errors" in the linearized version of 
this estimator are just Bernoulli and therefore are sub-Gaussian. 



The following result is an extension of the result obtained by lChernozhukov et al.1 (J2009J); 
unlike their result, this result allows for a non-vanishing specification error. In particular, 
we make a distinction between O := Q~ 1 E[(ei + ri) 2 pip'^Q~ x , and Qo '■= Q 1 ^[eJpip'^Q -1 
which are potentially asymptotically different if R 2n 0. 
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Theorem 4 (Strong Approximation by a Gaussian Process). Consider our least squares 
problem or, more generally, any problem where the estimator of g(x) =p(x)'/3 + r(x) takes 
the form p(x)'f3, where (3 admits uniform linearization of the form ft4.9\ )- ffTlfy . Suppose 
that A.1-A.3 hold, A. 5 hold with m > 3, and that R\ n = op(a~ 1 ), where for purposes of an 
application later we need a n = logn, and that 

a 6 n k 4 f k (l+el4) 2 log 2 n/n^0. 

Then for some A4 ~ N(0,lk), 

_a'(/3-/3) a'Vt 1 / 2 Kr . „,_,,,. 

n^l/ 2 || =d j^T/^ + °HO m * ( S ) 

as stochastic processes indexed by a G S k ~ 1 , so that for s(x) = [p(x)'Q l l 2 ]' 

^ \\s\ X )\\ =d isjm Mk+op{an ] m£ {x) ' 

and ifswp xeX y/n\r(x)\/\\s(x)\\ = op^ 1 ), 

v^ ?( f"^ X) = d {%*f k + o P (tf) in i°°(X). 
\\s(x)\\ \\s{x)\\ 

Under R\ n = op(a~ 1 ), we set Q, = Q, and under i?2n = op((t^ 1 ) we can set = £1$. 

This result is much stronger than the pointwise normality result: it asserts that the entire 
studentized nonparametric estimation process is uniformly close to a Gaussian process of 
the stated form. 

Another related inference method is the weighted bootstrap. Consider a set of weights 
hi, . . . ,h n that are i.i.d. draws from the standard exponential distribution. For each draw 
of such weights, define the weighted bootstrap draw of the least squares estimator as a 
solution to the least squares problem weighted by hi , . . . , h n , namely 

fi h £ arg mmK n [hi( yi - p-6) 2 ]. 



The following theorem establishes that the weighted bootstrap distribution is valid for 
approximating the distribution of the least squares estimator. 

Theorem 5 (Weighted Bootstrap Method). (1) Suppose that A.1-A.5 hold and (n 2//m logra+ 
kl^c 2 ,)^ log 6 n = o(n). Then the weighted bootstrap process satisfies 

^a(x)'0 b = a( X yG n [(hi - + n)] + R ln , 
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where 

/ tl 1 4 

supl-Rml <p \ ik ° S " (n 1 /™ v/fc^ + x/fe4 Cfc ) = (l/ log n). 
xex V n 

The bound continues to hold in P '-probability if we replace the unconditional probability P 
by the conditional probability P*(-\X). 

(2) Furthermore, under the conditions of Theorem^ the weighted bootstrap process in- 
dexed by a E S k ~ 1 approximates some Gaussian process Mk ~ N(0, If.) defined in Theorem 
that is: 

\\n-V 2 ^i0 b -P)-N k \\ = o P (l/logn). 
We close this section by establishing sufficient conditions for the consistent estimation of 

n. 

Theorem 6 (Matrices Estimation). Let Q = E\pip'j\ and E = E[(ei + r ; if"PiV'\\- Assume 
that v\ = E[m.ax.i<i< n |ej| 2 ] is such that (1 + £\c\k logn + v log 2 n = op(n). Under 
A.1-A.2, for Q = E n [p^] and S = E n [efpj^], where ?j = yi — p'fi, we have we have 



||Q-Q||<pWS^||Q||= (1) and ||S-S||<p(||Q||V||S||)K + 4c fc )\/^^. 

y n y n 

Moreover, under these conditions if for some sequence a n — > oo, \\Q — Q\\ = op(l/a n ) and 
||S — S|| = op(l/a n ), we have \\Q. — = op(l/a n ) where Q, = Q~ l Y,Q~ l . 

In the case of a bounded basis, Theorem [6] allows for consistent estimation of the matrix 
Q under the mild condition /clogn = o(n). Not surprising, the estimation of X depends on 
the tail behavior of the error term. We note that under condition A. 5, we have t> 2 < n 2 ' m . 

5. Rates and Inference on Linear Functionals 

In this section derive rates and inference results for linear functionals 9(w),w £ I of 
the conditional expectation function such as its derivative or average derivative. By the 
linearity of the series approximations, the linear functionals can be seen as linear functions 
of the least squares coefficients f3 up to an approximation error, that is 

0(w) =£(w)'p + r n (w), w el, 

where £(w)'/3 is the series approximation, with £(w) denoting the fc-vector of loadings on 
the coefficients, and r n (w) is the remainder term, which corresponds to the approximation 
error. 
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In order to perform inference, we construct estimators for a^{w) = £(w)'Q£(w)/n, the 
variance of the associated linear functionals, as 

al(w) = £(w)'n£(w)/n. (5.13) 

By Theorem [6l under our conditions ()5. 13|) is uniformly consistent for a^(w), namely 
<Jn( w ) I a n( w ) = 1 + °piX) uniformly over to£l. 

5.1. Pointwise Results for Linear Functionals. Next we state regularity conditions on 
the loadings and approximation errors associated with the linear functional 9. 

Condition P. 

P.l The approximation error is small, namely T,/n\r n (w)\/\\£(w)\\ = o(l). 
P.2 The norm of the loading £(w) satisfies: \\£(w)\\ < £o(k,w). 

Theorem 7 (Pointwise Convergence Rate for Linear Functionals). Assume that the condi- 
tions of Theorem and Condition P hold, then 

\0(w)-6(w)\< P ^. (5.14) 
To perform inference, we consider the t-statistic: 

ejw) - 6{w) 

a n (w) 

Under Condition P, the approximation error is small, so that 

£{w)'0-P) , 

tniw) = ^ , , h Op(l). 

a n (w) 

We can carry out standard inference based on this statistic because t n {w) — >d iV(0, 1). 

Theorem 8 (Pointwise Inference for Linear Functionals). Suppose that the conditions of 
Theorem^ Theorem® and Condition P hold. Then, 

t n (w) -+ d N(Q,l). 

5.2. Uniform Results for Linear Functionals. In studying uniform rates and inference 
we use the sup norm over the indices of interest I C M. d , namely, for / : I \— > W 71 , define 
the norm 

||/||i:=sup|/H|. 

We shall invoke the following assumptions to establish rates and uniform inference results 
over the region I. 
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Condition U. 



U.l The approximation error is small, namely -v/nlognsup ||r n («;)/||^(u;)|| || = o(l). 

weX 

U.2 The loadings £(w) are uniformly bounded and admit Lipschitz coefficients ^g(k,I), 
that is, 

U\\x < Zo(k,X), |KH - £(w')\\ < ${k,T)\\w - w% and 
log[diam(X) V£ e (k,I) V^ L (A;,X) V < log*. 
The value of jo (k,T) d e pends on t he cho i ce of basis for the series estimator and on the 



linear functional. 



Newevl <|l997h and 



Chenl (|2006l ) provides several examples. In the case 
of regression splines, after a possible renormalization so t hat X = [— 1, lj d , it has been 



established that £ fc < Vk and sup xeX \\d™p(x)\\ < k 1 ^ 171 (jNewevl . Il9971 ). With this basis 
we have for the function itself £g(k,Z) < Vk (0(x) = g(x) and £{x) = p(x)); for the 
derivative £o(k,I) < k 3 ^ 2 (9(x) = d Xj g(x) and £{x) = d x p(x)); for the average derivative 
ie{k) < 1 (9 = f d Xj g(x)dfi(x), supp(/z) C iatX, \d Xk fx(x)\ < 1, £ = f d Xj p(x)fi(x) dx = 
- J p(x)d Xj fi(x) dx). 

Theorem 9 (Uniform Convergence Rate for Linear Functionals) . Assume that the condi- 
tions of Lemma\^and Condition U hold, and d£g (k, T)£? log 2 n = o(n), then 

g 9 (fc,X)Vl 



sup w£l \9(w) - 9(w)\ < P M^i^I^. 



(5.15) 



In this case we consider the t-statistic process: 

f , . 9(w)-9(w) 
< tn(w) = 



a n (w) 

Under our assumptions the approximation error is small, so that 

= + 0p (l/l og n) in 

a n (w) 

The main result on inference is that the t-statistic process can be strongly approximated 
by the following Gaussian coupling: 



IM'^/Wfc/y^ ... ^ ^ (5-16) 
a n (w) 



well 



The following theorem shows that these couplings approximate the distribution of the t- 
statistic process in large samples. 
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Theorem 10 (Strong Approximation of Inferential Processes by Gaussian Coupling). Sup- 
pose that the conditions of Theorems^ Theorem® and Condition U hold. Then, 

tnW = d t*H + op (1/ log n), int°{Z). 

To construct uniform two-sided confidence bands for {9{w) : w 6 I}, we consider the 
maximal t-statistic 

\\t n \\ x = sup \t n (w)\, 
as well as the couplings to this statistic in the form: 

KHz = sup|t*(w)|. 

w&X 

Ideally, we would like to use quantiles of the first statistic as critical values, but we do not 
know them. We instead use quantiles of the second statistic as large sample approximations. 
Let k n (l — a) denote the 1 — a quantile of random variable conditional on the data 

V n , i.e. 

k n {\ -a) = inf{t : P(\\t* n \\ x < t\V n ) > 1 - a}. 

This quantity can be computed numerically by Monte Carlo methods, as we illustrate in 
the empirical section. 

Let 5 n > be a finite sample expansion factor such that 5 n log 1//2 n — > but 5 n log n — > oo. 
For example, we recommend to set 8 n = l/(41og 3//4 n). Then for c n (l — a) = k n (l — a) + 5 n 
we define the confidence bands of asymptotic level 1 — a to be 

[L(w),l(w)] = [6(w) - c n (l - a)a n (w), 6(w) + c n (l - a)a n (w)}, w el. 

The following theorem establishes the asymptotic validity of these confidence bands. The 
last result relies on the additional property of anti-concentration. The anti-concentration 
property holds if, after appropriate scaling by some deterministic sequences a n and b n , the 
inferential statistic a n (||t n ||j — b n ) has a continuous limit distribution. More generally, it 
holds if for any subsequence of integers {n^} there is a further subsequence {n^} along 
which a nkr (\\t nkr \\x — b nkr ) has a continuous limit distribution, possibly dependent on the 
subsequence. We expect anti-concentration to hold in our case, but our constructions and 
results do not critically hinge on it. 

Theorem 11 (Uniform Inference for Linear Functionals) . Suppose that the conditions of 
Theorem [121 hold. 
(1) Then 

P{\\tn\\x < Cn(l - a)} > 1 - a + o(l). (5.17) 
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(2) As a consequence, the confidence bands constructed above cover (w) uniformly for all 
w € I with probability that is asymptotically no less than 1 — a, namely 

P{9(w) G [L(w), i(w)], for allw el} > 1 - a + o(l). (5.18) 

(^3) T/ie width of the confidence band 2c n (l — a)a n (u;) o6eys uniformly in w £ I: 

2c n {l - a)d n (w) = 2k n (l - a)(l + o F (l)KM- (5.19) 

(^J Furthermore, if ||i^||x does not concentrate at k n (l — a) at a rate faster than ylog n, 
that is, it obeys the anti- concentration property P(p* [|x < k n (l — a) + e n ) = 1 — a + o(l) 
/or any e n = o(l/yTogn), i/ten i/ie inequalities in j5.17\ ) and \5. 18\) hold as equalities, and 



the finite sample adjustment factor 5 n could be set to zero. 

Theorem [11] shows that the confidence bands constructed above maintain the required 
level asymptotically and establishes that the uniform width of the bands is of the same 
order as the uniform rate of convergence. Moreover, confidence intervals are asymptotically 
similar under anti-concentration . 



A similar strategy was proposed in lChernozhukov et al.1 (J2009) for inference on the min- 



imum of a function. Since the limit distribution may not exits, the insight was to use 
distributions provided by couplings. Because the limit distribution does not necessarily ex- 
ist, it is not immediately clear that the confidence intervals maintain the right asymptotic 
level. However, the additional adjustment factor S n assures the right asymptotic level. A 
potential downside for using the adjustment 5 n is that the confidence intervals may not 
be similar, i.e. remain asymptotically conservative in coverage. However, the width of 
the confidence intervals is not asymptotically conservative, since 5 n is negligible compared 
to k n {\ — a). Nonetheless, if the anti-concentration property holds, then the confidence 
intervals automatically become asymptotically similar. 

6. Tools: Maximal Inequalities for Matrices and Empirical Processes 

In this section we collect the main technical tools that our analysis rely upon, namely 
Khinchin Inequalities for Matrices and a Data Dependent Maximal Inequalities. 

6.1. Khinchin Inequalities for Matrices. Consider the Schatten norm Sp on symmetric 
k x k matrices Q as 

\\Q\\sp= fl>i(Q)P 




22 



BELLONI, CHEN, CHERNOZHUKOV, AND KATO 

|| and p = 2 the Frobenius norm. It is 



The case p = oo recovers the operator norm || 
obvious that for any p > 1 

||Q||<||Q||s P <fc 1/p ||Q||. 

Therefore, setting p = log k, we get equivalence 

||Q||<||Q||s logfc <e||Q||. 



.20) 



Lemma 3 (Khinchin Inequality for Matrices). For symmetric k x k-matrices Qi, i = 
l,...,n, and 2 < p < oo, and an i.i.d. sequence of Rademacher variables E\,...,e n we have 



ap 



(E„[Q 



2lW2 



S P 



< (EeWGnfaQm) <b P (E n [Qi]) 



1/p 



,2 n l/2 



Sp 



where 



bp < ^n/e] 1 / 2 ■ y/p. 
As a consequence of equivalence 116. 20\) if k > e 2 we have 

E £ \\G n [em\\ < V^gk\m n [Ql]) l/2 \\ 

The notable feature of this inequality is the y'log k factor instead of the \fk factor ex- 
pe cted from the conventional m aximal inequalities based on entropy. This inequality due 
to iLust-Picard and Pisierl (Il99ll ) generalizes the Khinchin inequ ality for vectors. A version 



of this inequality was derived by iGuedon and Rudelsonl (|2007l ) using generalized entropy 
(majorizing measure) arguments. This is another striking example where the us e of gener- 
alized e ntropy yields drastic improvements over the use of entropy. Prior to this iTalagrand 
(|1996al ) provided ellipsoidal examples where the difference between the two approaches was 
even more extreme. 

6.2 . LLN fo r Mat rices. The following lemma is a variant of a fundamental result obtained 
bv lRudelsonl jl999| ). 



Lemma 4 (Matrix LLN). Let Qi,..., Q n be i.n.i.d. symmetric non-negative k x k-matrices 
with k > e 2 such that Q = K n [E[Qi\] and \\Qi\\ < M a.s., then for Q = E n [Qj] 



A:=E\\Q-Q\\< 



M(l + ||Q||)logn 



n 



In particular, if Qi =p%p' i , with \p%\ < Ck a.s., then 



A:= E\\Q-Q\\ < 



'^(l + HQIDlogn 



n 
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6.3. Maximal Inequalities. Consider a function class T collecting functions mapping 
some set Z to M, equipped with an envelope function F(z) > supj g jr The covering 

number N (e , T , L 2 (Q)) is the minimal number of L 2 (Q)-balls of radius e needed to cover 
the function set J- . The covering number relative to the envelope function is given by 

iv(e||F|| Qi2 ,^,L 2 (Q)). (6.21) 

The entropy is the logarithm of the covering number. 
We rely on the following result. 

Proposition 3. Let (e\,X\), . . . , (e n ,X n ) be i.i.d. random vectors in M. d+1 with E[ei\Xi] = 
and a 2 := sup^ E[e 2 \Xi = x] < oo. Let J 7 be a class of functions on M. d such that 
E[f(Xi) 2 } = 1 (normalization) and WfW^ < b for all f G T. Let Q := {(e,x) 3 R d+1 i-> 
e/(x) : / € J-}. Suppose that there exist constants A > e 2 and V > 2 such that 

supN(g,L 2 (Q),e\\G\\ LHQ) ) < (A/e) v 
Q 

for all < e < 1 for the envelope G(e,x) := |e|6. // for some m > 2 E^ei!™] < 00 > then 



E 



i=l 



< C 



(a + y /E[\e 1 \ m ]) y /nV\og(Ab) + Vb m ^ m ~^ log(Ab) 



where C is a universal constant. 



The proof is based on a truncation ar gument and maximal i nequa lities for uniformly 



bounded classes of functions developed in 
sion. 



Gine and Koltchinskii 



(|2006h . We recall its ver- 



Theorem 12 (IGine and Koltchinskiil (120061 )). Let ^i,...,^ n be i.i.d. random variables 
taking values in a measurable space (S, S) with common distribution P. Let J 7 be a suit- 
ably measurable class of functions on S with envelope F. Let a 2 be a constant such that 
supj-g^r var(/) < a 2 < \\FW^ 2 ,py Suppose that there exist constants A > e 2 and V > 2 such 
that sup Q N(F, L 2 (Q), e\\F\\ L 2 (Q) ) < (A/e) v for aM < e < 1. Then, 



E 



8=1 





< c 







I 2xn A\\F\\ L 2 iP) 
na z V log b — 



+ loo 



A\\F\ 



LHP) 



a 



where C is a universal constant. 
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Appendix A. Proofs 



A.l. Proofs of Sections H and 01 

Proof of Proposition Ul For any 7, J{^'p) 2 dF = J( / y'p) 2 (dF/diJ.)dfi. So that if dF/dji is 
bounded above and away from zero, the result follows since the basis is orthonormal under 

o*». □ 

Proof of Proposition [H For a function / = X^>i Pj( x )j~ a £j £ •^*( Q! ) Af O^) := Sj>fc+i Pj( x )j a ^j 
and 

:= ^V^" 1/2) log MT Q+1/2 . 
Then, the statement of the lemma is equivalent to 



p(svv\A f (x)\<v k ) =l-o(l). 



Consider an e-net Af e for X, and for some L > 1 let := {/ G F : |Ay(x) — j4j(x')| < 
L||x — x'|| for all x,x' G A'}. Then 

P(su P:re ^ 1^/(^)1 > < P(f i U k ) + P{f G ?4,sup, eA/ - £ \A f (x)\ >v k - Le) 

< P(f i H k ) + We\ max^ejv; P(|A/(x)| > u fc - Le) 

Note that we can take \N e \ < (diam(Af)/e) rf and 

^[^(x) 2 ^] = E[(£ i >k+ir a PjWti) i \*] = '■>:, .,.;./ 2 "/'ji,<-Kj.<-: 

< £r 2a+1 su Pj > fc+1 p 2 (x). 

Thus, setting e = k~ a+1 ^ 2 /L and ^ := d log(diam(Af) / e) k~ a+l l 2 we have Le < v~k and 
since -A/ (2) ~ N(0, E[Af(x) 2 \x]) we have 

P (7 G W fcj sup > v k - Le) = o(l). 

Next, to bound P(f ^ %fc), note that / is L-Lipschitz if 

T,j>k+i{pj( x ) -Pj{x ! )}r a ij 



Z : = sup 

x,x'&X 



< L. 



\\x — x' II 

Since sup x6 _^ j>k-\-l \WPj( x )\\ — Mj, we have that for 5 G (0, 1) 

P(Z > Ej>k+i r a Mj V2 log(2j V*)) < P(3j > fc + 1 : l&l > V / 21og(2j 2 /<5)) 

Thus, the result follows by noting that 

log(diam(^)/e) = log(Ldiam(;r)£^ 1/2 ) < log A; 
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provided we choose 8 = o(l) so that j~ a Mj -y/2 log(2j 2 /<5) = o(l) which leads to 

^ j- a M jy /2log(2p/5)<l. 

j>k+l 



A.2. Proofs of Section 3ZD 

Proof of Theorem [IJ We have that 

lb - 2lk 2 < lb - p'/3||f,2 + - p'0||f,2 < c fc + ||p'/3 - j/^l^a 
where under the normalization Q = E[p(x)p(x)'\ = I we have 



\\ P '/3-p'Mf,2 



1/2 



{p-P)'p{x)p(x)'{p-P)dF{x) 
To prove the result we need to show ||/3 — /3|| <p \Jkjn. We have 

By the Matrix LLN of Lemma HJ which is the critical step, we have that 

\\Q-Q\\ ^pOif ii^^O. 

n 

Therefore 



□ 



llQ-Xb^H < p ||E n [ Pi e,]|| < P JkJ^i 
since A m i n (Q) > 1/2 wp — > 1 and by of bounded 

i?[||E n [ Kei ]|| 2 ] = £[e 2 ^/n] = EWUpi/n] < E[p' lPi /n] = k/n. 
Moreover when \ 0, 

since A m i n (Q) > 1/2 wp — > 1. Moreover, since fj := p / i Q _1 ]En[Pj r 'i] is a sample projection of 
rj on pi, so that 

||Q- 1/2 E n b^]|| 2 = Enhfi] = E n f 2 < E n [r 2 ] < P £[r 2 ] < c 2 , 
by the Chebyshev inequality. 
Moreover when c& \ Cqo > 0, 

IIQ-XMII < HQ- 1 !! sup j'Enlpin) < P J- -4c fc , 

||7ll=l V " 

where the last inequality is by Step 2 in the proof of Lemma [2j □ 
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A.3. Proofs of Section 1131 



Proof of Lemma {J\ Decompose 

y/Ea'0-p) = a'G n [ Pi (ei + r<)] + o/[Q~ l - I]G n \pM + n)]. 



We divide the proof in three steps. Step 1 and 2 establish the first linearization result. 
Step 3 provides the last result (a bound on i?2n)- 

Step 1. Conditional on X = [xi, . . . , x n ] the term 

a'iQ- 1 - I]G n [p;e;] 

has mean zero and variance bounded by c/fQ 1 — i]QcT 2 [Q -1 — I]a. Next, since the design 
is random, by Matrix LLN Lemma H] we have 

a'[Q~ l - I]Qa 2 [Q- 1 - I]a < a 2 \\Q\\\\Q- l \\ 2 \\Q - if 

<P a 2 X max (Q)XA(Q) ekl ° gn 



n 

2 1 

P ^ Z =°( l ) 



- D 



n 



since ^ logn = o(n). We then conclude by Chebyshev inequality that 



a'tQ^-J^M <pJ ekl ° gn 



n 



Step 2. Under the random design, we have E\piVi] = 0. Thus, by Matrix LLN Lemma [4] 
so that \\Q — I\\ <p (£ 2 log n/ra) 1 / 2 and Lemma[5]we have 

\a'{Q~ l - I)G n [piri]\ < ||Q _1 - I\\ sup j'G n \piri] 

Il7ll = l 

< WQ^WWQ - I\\ sup^'Gnlpin] 



n 
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since 



E[ sup |G n [ 7 Vi]|] < -^E 
IMI=i v n 




< 



y/n' 



X^(xi)r(a 



vi=l 



< 4cfcv / ^R^i)FI A^fcCfc < t k c k Vk. 
We used that E[piri] = and that for any ||7|| = 1 



i=l 



n k 
t=l J=l 



j=l \i=l / 



< 



fe / n 



k / n \ 

j=i \i=i J 



(Note that here we cannot rely on the initial conditioning argument used in Step 1.) 
Step 3. Under random design we have E\piri] = 0. Thus, the term 

R 2n = a 'G n \piri] 

has mean zero and variance 

EW Pin ] 2 < E[a' Pi } 2 £ 2 k 4 < 44- 
Thus, using Chebyshev inequality, this steps gives the second linearization result. □ 

Proof of Theorem^ It suffices to prove the first result only. For any sequence a G S k ~ l , 
we can write 



where 



a' Pi , £k , 

Uni = n ,^-1 /on -7=) ^ni < -7=, e«+ r i < e d+4Cfc 
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n>n > & Q 



nE\\u ni \\ 2 < E[a'pi] 2 /a'na < 1/a 2 . 



(A.22) 



(A.23) 



Next we verify the Lindberg condition for the CLT. First, by construction we have 

n 

^2uj ni (ei +ri)j = 1. 



var 



i=l 



Second, for each 5 > 



J2 E [\\^ni\\ 2 (ei + ri) 2 l{\\uj ni \\\ei + r t \ > 5}] -> 0, 



i=i 



since the left hand side is bounded by 

2nE [\\Lo ni \\ 2 € 2 l{\€i\ + £ k c k > <V|Ki||}] + nE [\\oj ni \\ 2 £ 2 k c 2 k l{\ei\ + £ k c k > <5/||w ni ||}] , 
and both terms go to zero. Indeed, for the first term we have 



nE 



\uru\rE 



; {N + c fc4 > W n /€k}\Xi 



< nE [\\uj ni \\ 2 ] ■ sup E e 2 {\ei\ + c k £ k > 5y/n/£ k }\xi = x 

< oT 2 o{l) = o(l) 



where we used ()A.23[) . the uniform integrability in A. 4 and 5^/n/^ k — c k £ k — > oo; and for 
the second term 



nE 



Uni\\ 2 £ 2 k c 2 k P |e;| + C k £ k > 5y/n/£ k \Xi 



< nE [\\u] ni \\ 2 £lcl] • supP |ej| + c k £ k > 5y / n/£ k \x i = x 



< <L- 2 tl4 ■ 



o(l) 



where we used (|A.23|) . dy/n/^k — c k £ k — > oo and c k £ k = o(5y / n/^ k ). 



□ 
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A. 4. Proofs of Section WM 

Proof of Lemma Decompose 

y/na(x)' (ft - P) = a(x)'G n [pi(ei + + a(x)'[Q~ l - i]G n [pi( e < + r i)\- 

Step 1. Conditional on the data, let T := {t = (ti,...,t n ) G M. n : ti = a(x)'(Q^ 1 — 
I)pi€i,x G X}. Define the norm || • || nj2 on R n by ||t||^ 2 = n~ l Ym=\ Letting rji,...,r] n 
be indepen dent Rademac her random variables independent of the data, we have by Dudley's 
inequality (jDudlevl . 119671 ) 

£„[sup HxYiQ- 1 - I\G n [rHPi€i]\] <C f J\ogN(e,T,\\ ■ \\ n<2 )de, 

x&X JO V 

where 9 := 2sup 4gT ||i[| n(2 = 2sup ;ceA . \\a{x)' (Q~ l - I)piei\\ L 2^ n) < 2maxi<j< n MH^ 1 - 
/||||Q|| 1/2 . Since for any x,x' G X, 

||a(x)'(Q _1 - I)pi£i - a(x')'{Q~ l - I)pi€i\\ L 2 {Fn) 

< \\a(x) - a(x')\\ ■ \\iQ- 1 - I)pi€i\\tf(v n ) 

< L lk £ k max |ej|||Q _1 - - x'\\ 

l<i<n 

=: L' lk max |ei| | HQ" 1 — I\\ \\x — x'\\, 

l<i<n 

we have 

m T> n . w £ ^L; t max 1 < i <„| ei |IIIQ-'-/|lj ' 

Thus we have 

r* i ~ r2\\Q\\ 1/2 / 

/ JlogN(e,T,\\ ■ ||„, 2 )<fe < max HQ" 1 - I\\ / J d\og(CL' lk /e)de. 

JO V l<i<n J Q V 

ByA.B.wehave^maxi^nleil | X) < P n l l m . Since HQ" 1 < P ^ log n/n, ||Q|| < P 1 
and \ogL' lk < logn, we have 

£[sup \a{x)'[Q- 1 - I\G n \pi€i}\ | X] < 2E[E n [su V \a{x)'[Q~ x - /]G n [ W e 4 ]|] | X] 

x&X x£X 



H log 2 n 



j n 

where the first inequality is due to the symmetrization inequality. Thus, we have 



sup WxYlQ- 1 - I}G n [ mi ]\ < P n l '™\ ^ bg2n . 

xdX V n 
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Step 2. Observe that 

sup \a{x)'(Q~ x - I)G n [piri}\ < \\Q~ X - I\\ sup |G n [7Wi 

x£X || 7 ||=1 

We wish to bound sup|| 7 || =1 |G n [7'j>jrj]|. Recall that E[piTj\ = 0. For any 

n 



1, 



i=i 



n k 

YY^ p ^ x ^ r ( 3 
i=\ j=i 



Y^ ^2Pj( x i) r ( x i) 

j=l \i=l , 




k / n \ • 

Y Y p ^ x ^ r ^ x ^ 
j=i \i=i j 



k / n \ 

Y Y p i^ r ^ 

0=1 \t=l / 



Taking expectation, we have 



1 



£[sup |G n [ 7 Vi]|] < -^E 

Il7ll = l v n 




< 



k 
3=1 



,i=l 



< hc k \/^[||p(^i)|| 2 ] A £ k c k < £ k c k Vk. 



Thus, we have 



sup \a(x)'(Q - I)G n [piri]\ < P 



'£?logn 



■n 



£ k c k Vk. 



Steps 1 and 2 give the first linearization result. 

Step 3. We wish to bound sup xGX \a(x) > G n \Pii'i]\- We use Theorem [PZ1 Consider the 
class of functions 

F := {a{x)'p(-)r{-) : x 6 X}. 
Then, \a(x)'p(-)r(-)\ < l k c k t; k and for any x,x S X, 

\a(x)'p(-)r{-) - a{x)'p{-)r{-)\ < l k c k L lk £ k \\x - x\\, 
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SO that 



sup N(T,L 2 (Q),s£ k c k £ k ) < 
Q 



e 



Thus, by Theorem 1121 we have 



£[sup \a(x)'G n [piri]\] < £ k c k y/\og n + £ k c k — -p - < £ k c k \/]ogn, 



where we have used the fact that 



Cfclogn r /£flogn 



\/logn\ = o( 0OK V I 



Therefore, we have 



sup \ a(x)'G n \piri}\ < P yjlog ni k c k . (A.24) 

This completes the proof. □ 

Proof of Theorem^ We wish to bound sup^g^ la^yGnfpjej]!. To this end, we use Propo- 
sition [3l Consider the class of functions 

Q := {{e,x) i y ea(v)'p(x) : v G X}. 

Then, \a(v)'p(xi)\ < £ k , v&r(a(v)'p(xi)) = 1 and for any v,v € X, 

\ea(v)'p(x) - ea(v)'p(x)\ < \e\L lk ^ k \\v - v\\. 

Thus, taking G(e,x) := \e\^ k , we have 

S upN(g,L 2 (Q),e\\G\\ L 2 {Q) )< (< ^ 



Therefore, by Proposition [31 we have 



j.m/(m— 2) -i 
4 . Mog n 



£[sup |a(x)'G n [p iei ]|] < y/bg^ + ^ < (A.25) 



where we have used assumption A. 5 



A m/(m— 2), / >2m/(m-2) , 

4 ; logn , ; logn r 

* ^ = x/logn • A/ -S < Vlogn. 



This completes the proof. □ 
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Proof of Theorem [^} The proof follows similarly to that in IChernozhukov et al.l (|2009l ) and 
has two steps: in the first, we couple the estimator \fn~ifi — j3) with the normal vector; in 
the second, we establish the strong approximation for the series estimate of the function. 

Step 1. We shall apply Yurinskii's coupling (see Theorem 10 in I Pollard! (|2002l )): 

Let Cij • • • i Cn be independent if -vectors with E[Q] = for each i, and A := ^ E'HCiH 3 
finite. Let 5 denote denote a copy of Ci + • • • + Cn on a sufficiently rich probability space 
(Q,A, P). For each 5 > there exists a random vector T in this space with a N(0, var (£)) 
distribution such that 

P{\\S - T\\ > 35} < C B (l + iigg^/g)l j W here B := AK5~ 3 , 

for some universal constant Cq. 

In order to apply the coupling, consider a copy of the first order approximation to our 
estimator on a suitably rich probability space 



1 n 

In 

i=l 



N(0,I k ), 



Then since maxeig(f2 1 ^ 2 ) is bounded by an earlier argument, 

E\\Cif < E\\p t (e t + r t )\\ 3 

< E[\\ Pi f}(l + 44) 

< E[\\ Pi \\^ k (i + 44) 

< H k (i+44) 

where we used the assumption that S[|ej| 3 |xj] are uniformly bounded. Therefore, by Yurin- 
skii's coupling, for each 5 > 



> 35a 



-i 



< 



< 



nk\ k {l + 44) f | log(fc 2 g fc (l + ^ c 3 )) 



((ki^V™) 3 
a 3 fc 2 g fc (l+^ c 3) 

(5nV2) 



1 + 



log n 



by a w fc 4 e|(l + ^c 3 ) 2 log 2 n ^ Qj 



n 
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Finally by combining the preceding step with the assumption on the linearization error 
Rin, we obtain for a copy of \fn(J5 — f3) on a suitably rich probability space that obeys 

n n 

\\si- x ' 2 ^(p-p)-Af k \\ < ll^^Ci-A^II + ll^ 172 ^-/?)-^^^!! 

v n i=\ v n i=l 

< o P {a~ l ) + R ln = o P (a~ 1 ). 

This proves the first part of the theorem. 
Step 2. Using the result of Step 1 and that 



we conclude that 



satisfies 



\S n (x)\ ■ 



Sn(x)\\ \\Sn(x)\\ 



\Sn{x) 



\s n {x) 



sup|5 n (x)| < sfrSl- l ' 2 (p-p)-N k =o P (a~ 1 ), 

x£X 



(A.26) 



Finally, 



sup 

x£X 



Vn(g(x) - g(x)) s n (x)'M k 



< sup 

xex 



|| (a?) || ||SnO)|| 



+ sup 



\\ s n{x)\\ \\Sn(x)\\ 

^is n {x)'n- l l 2 0- p) s n {x)'M k 



\\s n {x)\\ \\Sn(x)\ 

= sup \^/nr(x)/\\s n (x)\\\ + sup \S n (x)\ = o P {a~ 1 ) + op(a~ 1 ), 
xex xgx 

using the assumption on the approximation error r(x) = g{x) — p n (x)' f3 and the bound 
(TOHD . □ 

Proof of Theorem \5[ Note that /3 solves the least squares problem for the rescaled data 
{{VhiVi, VhiPi) '■ 1 < i < n}. The weight hi is independent of (yi,Pi), E[hi] = 1, E[h 2 ] = 1, 
and maxi<j< n hi <p logn. That allows us to extend all results from /3 to (3 b replacing £ k 
by ^ = £k logn to account for the larger envelope, and p\ = hiPi. 

We apply Lemma [2] to the original problem and to the weighted problem by {hi}. Then 

V^(p b -P) =^p fe -/3)+^(/3-^ 
= G n [(hi - l)pi(ei + n)] + r r , 
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where ||r n || < P \ A fcl ° g n (n 1 / m y/logn + y/k\ognl k c k ). 

Note also that the results continue to hold in P-probability if we replace P by P*, since 
if a random variable B n <p 1 then B n <p* 1. Indeed, the first relation means that 
P(|B n | > i n ) = o(l) for any £ n — > oo, while the second means that P*(\B n \ > £ n ) = op(l) 
for any £ n — > oo. But the second clearly follows from the first from the Markov inequality, 
observing that E[P*(\B n \ > £ n )\ = P{\B n \ > £ n ) = o(l). 

The second part of the theorem follows similarly to Theorem H] by applying Yurinskii 
coupling for the weighted process V{ = hi — 1, where hi ~ exp(l) so that E[vf] = 1, 
Pflfij 3 ] < 1 and E[maxi<i< n \vi\] < logn. Thus there is a Gaussian process G n ~ iV(0, 
such that 



Q-l/2 n 
v 1=1 



P o(l/ logn). 



Combining the result above with the first part of the theorem, the second part follows by 
the triangle inequality. □ 

Proof of Theorem^ Under A.l and A. 2, we have that Q and E have eigenvalues bounded 
away from zero and from above uniformly in n, which implies that so does 0. 

Under our growth conditions, the first result follows from the Markov inequality and 
Lemmalto establish E[\\Q - Q\\] < \\Q\\ J$ log n/n = o(l). 

To establish the second result we note that 

E - S = E n [(e? - {a + n} 2 )p i p' i ] + E n [{ei + nf ViV ^ - E. (A.27) 

The first term on the right hand side satisfies 

||E n [(e 2 - {e, + r t f) VlV %\ < \\^ n [{p'i0 ~ P)Y PiPM + 2||E n [(e i + n) V \0 ~ PWilW 
< sup \p'i(P- /3)| 2 ||E n [p;p-]|| + max{|e;| + \n\} sup {^0-/3)1 • ||E n [pip-]|| -)•? 0, 

<P \\Q — H («n + 4cfc) Q 7= >P 0, 

n 

since su P(86A . Wi@ ~ P)? & ^(V^gri + Rm + P 2 n) 2 /n by TheoremEl ||E n [p^]|| < P 1 by 
the first result, maxj< n |rj| < ^fcC&, and maxj< n |ej| 2 <p v 2 by Markov inequality. 
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To control the last terms in (IA.27j) . note that 

(i) 



E[\\E n [{ei + rjW] - E||] < (1) 2EE e [\\E n [e l {e i + r^p^ 



< (2 ) 2 



<(3) 2 A /^e^[max 1 < l < n \ei + ri \(\\E n [{e t + rjWj H) 1 / 2 ] 



<(4) 2 A /^e fc (^[max 1 < i <„ |e, + r i | 2 ]) 1 /2 (£ ;[|| En[{e . + r,} l|]) 1/2 , 



where (1) holds by Symmetrization Lemma (Lemma 2. 3. 6 Ivan der Vaart and Wellnerl ([1996)) , 
(2) by Khinchin inequality, (3) by maxi<j< n \\pi\\ < and (4) by Cauchy-Scharwz. 

Since that for any positive numbers, a < R{a+b) 1 / 2 implies a < R 2 + R\/b, the expression 
above using the triangle inequality yields 

EiWE^et + nfp % p[\ - E||] < ^^(vl + 44) + (^^{vl + 44}) V2 l|£|| 1/2 . 

n \ n J 



The result follows from the Markov inequality. 



A. 5. Proofs of Section 15.11 

Proof of Theorem^ By Lemma Q] and PI 

\0(w) - 9(w)\ < \£(w)'0-(3)\ + \r n (w)\ 

< HM'^jp^i + mwm \R£ + \R2 n \) + ||^ H || rnW/ ^ 

< + li^l { \R ln] + 1^1] + fa(k,w)/y/n) 

where the last inequality follows by < £e(k,w) assumed in P2. 

Next note that by Lemma [T] 



□ 



\Rln\ + \R2n\ <P \ — (1 + \/k\ognl k C k ) + i k C k = o(l). 

y n 

Finally, since S^H'G^e;]! 2 ] < ||£(u;)|| 2 ||Q|| < $(k,w), the result follows by applying 
the Chebyshev inequality to establish that \£(w)'G n (pi^i)\ ^Sp Ce(k,w). 

□ 

Proof of Theorem^ First note that by A.l and A. 2, f2 has eigenvalues bounded away from 
zero and from above. Moreover, under A. 2, by Theorem [2] and PI 

* y . JM'N) | r n {w) _ £jwYn^Wk ( \\nv*t(w)\\ \ ( \\i{w)\\ 
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To show that the last two terms are op (1), note that by TheoremEl a n (w) > P \\VL l / 2 l{w)\\/ ^fn 
since a n (w) = (1 + o P (l))a n (w). 

Also because a n (w) = (1 + op(l))a n (w), the first term satisfies 



-> d N(0,l). 



□ 



A.6. Proofs of Section IQ1 

Proof of Theorem By the triangle inequality 

sup|0(to) -6(w)\ < sup \t(w)'0 - p)\ +sup|r n (w)| 

w£l wdX w&T 

where the second term satisfies sup^gj | r n (w)/\\£(w)\\ \ = o{n~ x l 2 log -1 n) by condition 
U.l. 

By Lemma [21 the first term is bounded uniformly over X by 

W$~P)\ <P l*M ; Gnfa*fr + r0]| +0p( g fl(fe>2)/[ ^ logn]) (A.28) 

\ n 



since ||^(io)|| < ^g(k,I) by U.2 and the remainder term of the linear representation in 
Lemma [2] satisfies sup^ g j ||r n (w;)|| = op ( 1/ log n). 

The result follows as the proof of Lemma [2] (|A.24|) and Theorem [3] (|A.25P to establish 

£(w)' 



su Pl nr.i/2^ ^i G n\pi(ei + n)]\ < P y/\ogn + ylog nt k c k . 



□ 



Proof of Theorem] 1 (A Under our conditions we have ||0 — f2|| = op (1/y/k log 3 / 2 n) and 
& n ( w ) = (1 + op (l/v^ log 3 / 2 n))a n (w) uniformly in w £ I. Then the result follows from 
Theorem 0] to obtain the Gaussian approximation. □ 

Proof of Theorem] 11\ Let e n = 1/logn, and 5 n such that ^log 1 / 2 n — > 0, and 5 n /e n — > oo. 

Let = ^^^-/v 7 "^ w £ I. Under our conditions we have D,\\ = op (l/Vk log 3 / 2 

and a n {w) = (1 + Op(l/y/k log 3 / 2 n))a n (w) uniformly in w G I. Then it follows that 
\\t* n \\i = su P«;ex 1^(^)1' which does not depend on the data V n , is such that 

p(\ iii;iix-iit;iixi< £ „) = i- (i). 

Now let k n (l — a) := (1 — a) — quantile of ||^||x, conditional on V n , and let K n {l — a) := 
(1 — a) — quantile of conditional on V n . 
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Then applying Lemma[6]to ||i^||z and we get that for some u n \ 

P[K n (p) > k n (p - v n ) - e n and k n (p) > n n (p - v n ) - e n ] = 1 - o(l). 

Claim (1) now follows by noting that 

P{\\t n \\ x > k n {l - a) + 8 n } < P{\\t n \\ x > n n {\-a-v n )-e n + 5 n } + o{\) 

< P{\\t*Jx > K n (l - a - u n ) - 2e n + 5 n } + o(l) 

< P{\K\\ X > k n (l -a- 2u n ) - 3e n + 6 n } + o(l) 

< P{\\t* n \\x>k n (l-a-2u n )} + o(l) 

= E P [P{\\t n \\x > Ml -a- 2u n )\D n }] + o(l) 

< Ep[u + 2v n ]+o(l) = a + o{l). 



Claim (2) follows from the equivalence of the event {6(w) 6 [l(w),l(w)] 7 for all w G 1} 
and the event {||t n ||x < c„(l — a)}. 

To prove Claim (3) note that a n (w) = (1 + op(l))a n (w) uniformly in w £ 1 under our 
conditions by TheoremJUJ Moreover, c n (l— a) = k n (l— a)(l+op(l)) because \/k n {l—a) <p 
1 and S n — > 0. Combining these relations the result follows. 

Claim (4) follows from Claim (1) and from the following lower bound. 

By Lemma [U we get that for some v n \ 

P[K n (p + v n ) + e n > k n {p) and k n (p + v n ) + e n > K n (p)\ = 1 - o(l). 

Then 

P{\\tn\\l > k n (l ~ a) + 5 n } > P{ \\t n \\ X > K n (l - a + V n ) + £„ + S n } ~ o(l) 

> P{\K\\l > ^n(l -a + V n ) + 2s n + S n } - o(l) 

> P{\K\\x > k n (l - a + 2^ n ) + 3e n + 5 n } - o(l) 

> P{\K\\i > fcn(l - a + 2z. n ) + 25 n ] - o(l) 

> £[P{||C||z > fc n (l " a + 2i/ n ) + 25 n \V n }\ - o(l) 
= a — 2f n — o(l) = a — o(l), 



where we used the anti-concentration property in the last step. 



□ 
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A.7. Proofs of Section M 



Proo f of Lemma [7j Using the Symmetrization Lemma (Lemma 2.3.6 
(1996)) and the Khinchin inequality, bound 



van der Vaart and Wellnei 



Since 



and 



one has 



A := E\\Q - Q\\ < 2EE £ \\E n [e i Q i )\\ < ^ ^£||(E n Qf )V 2 | 

E\\{E n Q 2 fl 2 \\ = E\\(E n Q*)\\ 1/2 < 
W^nQiW < A+HQII, 



ME\\E n Qi 



1/2 



a<,/M^ [a + ||0||]^ 



n 



solving which for A gives the result stated in the lemma. 



□ 



Proof of Proposition^ For a r > specified later, define e i := ej/(|ej| < r) — _E7[eji"(| | < 
r)\Xi] and e+ := ^/(N > t) - E^I (\ei\ > r)|X;]. Since E^X,} = 0, a = eT + ef. Invoke 
the decomposition 

n n n 

£ ax*) = e hf&i) + E e tm). 



i=i 



i=i 



i=i 



We apply Theorem [T2l to the first term. Noting that var(q f{Xi)) < sup x E[(e i ) 2 \Xi 
x}E[f(Xi) 2 } < snp x E[e 2 i \X i = x] = a 2 and e~ f(Xi) < 2rb, we have 



E 



i=l 



< c 



^Jno 2 Flog (Ab) + Vrb log(Ab) 



On the other hand, applying Theorem 2.14.1 of Ivan der Vaart and Wellnerl (|1996l ) to the 
second term, we obtain 



E 



i=l 



< Cy/nb^l 'E[\et\ 2 } y/V\og(A/e)de. (A.29) 



By assumption, 

E[\4\ 2 ] < E[e\l{\e x \ > r)] < t^E^H, 

by which we have 

(|X29]1 < C E[\e l \ m ]bT- m l 2+l ^nV log(A) . 
Taking r = 6 2 /( m_2 ) ; we obtain the desired inequality. 



□ 
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A. 8. Additional technical results. 
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Lemma 5. Let Z be a random vector in~M, k , M be a k x k matrix and T C R fc \ {0}. Then 
we have that 

i 



sup E 

7er 



-MZ 



7 



< ||M|| m sup E [\a'Z\ m ] . 



a =1 



Proof of Lemma\^ Let 7 achieve the supremum on the left hand side and set a = 7/H7I 
Then we have 



E[\a'MZ\ m ] = E[\(M'a)'Z\ m ] = \\M'a\\ m E 

< \\M'\\ m \\a\\ k E\\^ff^Z\> 

< \\M\rsup M=1 E[\a'Z\r 
since llall = 1 and M'al\\M'a\\ = 1. 



(M'a)' 
\\M'a\\ 



□ 



Lemma 6 (Closeness in Probability Implies Closeness of Conditional Quantiles). Let X n 

and Y n be random variables and T> n be a random vector. Let Fx n {x\D n ) and FY n (x\T> n ) 
denote the conditional distribution functions, and F^ (p\T> n ) and Fy^ (p\T> n ) denote the 
corresponding conditional quantile functions. If \X n — Y n \ = op(e), then for some v n \ 
with probability converging to one 

F x l(p\V n ) < Fyl(p + u n \V n ) + e and Fy^{p\V n ) < F x ]{p + v n \V n ) + e, Vp G {u n , 1 - v n ). 

Proof of Lemma{(Ji We have that for some v n \ 0, P{|A n — Y n \ > e} = o{v n ). This implies 
that P[P{\X n — Y n \ > e\V n } < u n ] — > 1, i.e. there is a set fi ra such that P(£l n ) — > 1 and 
P{\X n - Y n \ > e\V n ] < v n for all V n G O n . So, for all V n £ Sl n 

F Xn {x\V n ) > F Yn+£ (x\V n ) - v n and F Yn {x\V n ) > F Xn+£ (x\V n ) - v n ^x G R, 

which implies the inequality stated in the lemma, by definition of the conditional quantile 
function and equivariance of quantiles to location shifts. □ 
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